INTERSPEECH.2006 - Speech Synthesis

Total: 6

#1 Evaluating a virtual speech cuer [PDF] [Copy] [Kimi]

Authors: G. Gibert ; Gérard Bailly ; F. Elisei

This paper presents the virtual speech cuer built in the context of the ARTUS project aiming at watermarking hand and face gestures of a virtual animated agent in a broadcasted audiovisual sequence. For deaf televiewers that master cued speech, the animated agent can be then superimposed - on demand and at the reception - on the original broadcast as an alternative to subtitling. The paper presents the multimodal text-to-speech synthesis system and the first evaluation performed by deaf users.

#2 Intelligibility of machine translation output in speech synthesis [PDF] [Copy] [Kimi]

Authors: Laura Mayfield Tomokiyo ; Kay Peterson ; Alan W. Black ; Kevin A. Lenzo

One use of text-to-speech synthesis (TTS) is as a component of speechto- speech translation systems. The output of automatic machine translation (MT) can vary widely in quality, however. A synthetic voice that is extremely intelligible on naturally-occurring text may be far less intelligible when asked to render text that is automatically generated. In this paper, we compare the quality of synthesis of naturally-occurring text and its MT counterpart. We find that intelligibility of TTS on MT output is significantly lower than on either naturally-occurring text or semantically unpredictable sentences, and explore the reasons why.

#3 A technique for controlling voice quality of synthetic speech using multiple regression HSMM [PDF] [Copy] [Kimi]

Authors: Makoto Tachibana ; Takashi Nose ; Junichi Yamagishi ; Takao Kobayashi

This paper describes a technique for controlling voice quality of synthetic speech using multiple regression hidden semi-Markov model (HSMM). In the technique, we assume that the mean vectors of output and state duration distribution of HSMM are modeled by multiple regression with a parameter vector called voice quality control vector. We first choose three features for controlling voice qualities, that is, "smooth voice - nonsmooth voice," "warm - cold," "high-pitched - low-pitched," and then we attempt to control voice quality of synthetic speech for these features. From the results of several subjective tests, we show that the proposed technique can change these features of voice quality intuitively.

#4 Learning from errors in grapheme-to-phoneme conversion [PDF] [Copy] [Kimi]

Authors: Tatyana Polyakova ; Antonio Bonafonte

In speech technology it is very important to have a system capable of accurately performing grapheme-to-phoneme (G2P) conversion, which is not an easy task especially if talking about languages like English where there is no obvious letter-phone correspondence. Manual rules so widely used before are now leaving the way open for the machine learning techniques and language independent tools. In this paper we present an extension of the use of transformationbased error-driven algorithm to G2P task. A set of explicit rules was inferred to correct the pronunciation for U.S. English, Spanish and Catalan using well-known machine-learning techniques in combination with transformation based algorithm. All methods applied in combination with transformation rules significantly outperform the results obtained by these methods alone.

#5 Eigenvoice conversion based on Gaussian mixture model [PDF] [Copy] [Kimi]

Authors: Tomoki Toda ; Yamato Ohtani ; Kiyohiro Shikano

This paper describes a novel framework of voice conversion (VC). We call it eigenvoice conversion (EVC). We apply EVC to the conversion from a source speaker’s voice to arbitrary target speakers’ voices. Using multiple parallel data sets consisting of utterance-pairs of the source and multiple pre-stored target speakers, a canonical eigenvoice GMM (EV-GMM) is trained in advance. That conversion model enables us to flexibly control the speaker individuality of the converted speech by manually setting weight parameters. In addition, the optimum weight set for a specific target speaker is estimated using only speech data of the target speaker without any linguistic restrictions. We evaluate the performance of EVC by a spectral distortion measure. Experimental results demonstrate that EVC works very well even if we use only a few utterances of the target speaker for the weight estimation.

#6 Generating time-constrained audio presentations of structured information [PDF] [Copy] [Kimi]

Authors: Brian Langner ; Rohit Kumar ; Arthur Chan ; Lingyun Gu ; Alan W. Black

Presenting complex information in an understandable manner using speech is a challenging task to do well. Significant limitations, both in the generation process and from the human listeners’ capabilities, typically make for poorly understood speech. This work examines possible strategies for producing understandable spoken complex information working within those limitations, as well as identifying ways to improve systems to reduce the limitations’ impact. We discuss a simple user study that explores these strategies with complex structured information, and describe a spoken dialog system that will make use of this work to provide a speech interface to structured information in a more understandable manner.